WWW: West Coast Wildfires and Weather

Tutorial: Analysis of Wildfire and Weather Data

Mark Keller
CMSC320
Final Tutorial Project
29 November 2018

Table of Contents

  1. Introduction
  2. Data
    2.1 West Coast Wildfire Data
    2.2 Supplementary Weather Data
    2.3 Merging the data
  3. Exploration
    3.1 Univariate Analysis
    3.2 Multivariate Analysis
  4. Hypothesis Testing
  5. Regression Analysis
  6. Classification Analysis
  7. Conclusions

1. Introduction

For the past few weeks, the Camp Fire in Northern California has been a deadly and destructive force, recieving enormous media coverage. This deadly wildfire that began on November 8, 2018 has left 85 people dead, 249 missing, and destroyed thousands of buildings. It has resulted in major air quality issues, causing many to wear masks or stay indoors to protect their health. But this is not the first time a major wildfire has ravaged the west coast of the United States.

In this tutorial, we will analyze west coast wildfire data from 2010 to 2016, available through the University of California Irvine Data Science Initiative GitHub organization here. This data has been obtained through the NASA Moderate Resolution Imaging Spectroradiometer (MODIS) satellite. We will merge this wildfire data with daily weather data obtained from NOAA Climate Data Online tool.

We will begin by loading the data and performing exploratory data analysis before diving deeper into regression analysis and classification analysis using basic machine learning methods. These data science methods will allow us to look at past trends and make inferences about future west coast wildfires.

Wildfire Prevention

2. Data

We will begin by importing Python packages that will help us to load and wrangle the data:

  • pandas: a package that allows us to build data frames, a tabular data format that can be thought of like a spreadsheet
  • numpy: a package containing matrix and vector mathematical functions
In [1]:
import pandas as pd
import numpy as np

2.1 West Coast Wildfire Data

The wildfire dataset is contained in a comma-separated values file (CSV) located on GitHub. Pandas has a read_csv function that allows us to load a data frame from a CSV over the web. We will specify the data types of each column using a python dict to ensure that pandas loads the data using the data types that we will want to use in our analysis.

In [2]:
data_url = 'https://raw.githubusercontent.com/UCIDataScienceInitiative/Climate_Hackathon/master/west_coast_fires/fires.csv'
dtype = {
    'confidence': float,
    'day': int,
    'frp': float,
    'lat': float,
    'lon': float,
    'month': int,
    'year': int,
    'x': float,
    'y': float,
    'dayofyear': int,
    'vdp': float,
    'temp': float,
    'humidity': float
}
df = pd.read_csv(data_url, dtype=dtype)
df.head()
Out[2]:
id confidence day frp lat lon month year x y dayofyear vpd temp humidity
0 10122 0.61 1 6.0 32.997 -110.765 1 2010 1336.853783 110.667 0 0.232254 271.10 55.8
1 10123 0.72 1 11.1 32.995 -110.777 1 2010 1335.747633 110.445 0 0.232254 271.10 55.8
2 13460 0.57 2 6.3 33.650 -113.735 1 2010 1056.936097 183.150 1 0.333371 272.70 43.6
3 13461 0.62 2 10.2 32.358 -114.922 1 2010 950.464277 39.738 1 0.158511 274.70 76.8
4 22139 0.56 3 14.4 33.415 -110.860 1 2010 1325.030866 157.065 2 0.159304 268.62 63.5

We can use the shape variable to learn how many fire observations are contained in this dataset:

In [3]:
df.shape
Out[3]:
(216232, 14)

The next step is to determine what this wildfire data means so that we can perform exploration. Luckily, the README for the GitHub repository provides us with a description of each column:

  • id: unique identifier of the fire in the overall context of the world dataset
  • confidence: how much confidence the satellite has that this is actually a fire detection (percent)
  • day: the day of the month
  • frp: Fire Radiative Power, the strength of the fire
  • lat: latitude
  • lon: longitude
  • month: month of the year
  • year: year
  • x: x position in a uniformly-spaced grid
  • y: y position in a uniformly-spaced grid
  • dayofyear: day of the year (from 0 to 364 or 365 for leap years)
  • vpd: Vapor Pressure Deficit, the difference between the moisture in the air and the amount of moisture the air could hold
  • temp: temperature (degrees Kelvin)
  • humidity: humidity (percent)

To be extra cautious, we can restrict our analysis to those fires with high confidence.

In [4]:
confidence_threshold = 0.8
df = df.loc[df['confidence'] > confidence_threshold]

We can observe the size of our data after filtering by confidence threshold.

In [5]:
df.shape
Out[5]:
(97016, 14)

We will also want to exclude fires with a strength of zero.

In [6]:
frp_threshold = 1
df = df.loc[df['frp'] > frp_threshold]

We can again observe the size of our data, this time after filtering by frp threshold.

In [7]:
df.shape
Out[7]:
(96821, 14)

The wildfire dataset README notes that the same fire may be split into separate observations if it spreads over multiple locations or last multiple days. We can try to remedy this by consolidating fire observations which:

  • are close together
  • span multiple consecutive days.

According to the Wildfire page on Wikipedia, forest fires can spread as fast as 10.8 km per hour. Using this information, we can consider that in a single day, it is possible that a fire could spread 259 kilometers:

In [8]:
wildfire_max_daily_spread = 10.8*24
wildfire_max_daily_spread
Out[8]:
259.20000000000005

To easily identify consecutive days, we will first add a column called dayofdataset which is similar to dayofyear but does not restart at the end of the year.

In [9]:
min_year = df['year'].min()
max_year = df['year'].max()
df['dayofdataset'] = df.apply(lambda row: (row['year'] - min_year) * 365 + row['dayofyear'], axis='columns')

Next we will take the difference of dayofdataset for each each row with the previous row. The resulting groups of rows that are 1 day or 0 days apart will be the candidates for our consolidation procedure.

In [10]:
df['dayofdataset_diff'] = df['dayofdataset'].diff()
df.head()
Out[10]:
id confidence day frp lat lon month year x y dayofyear vpd temp humidity dayofdataset dayofdataset_diff
6 40048 0.94 4 66.9 33.418 -110.862 1 2010 1324.823718 157.398 3 0.290486 270.57 42.5 3.0 NaN
22 64733 0.84 5 37.3 34.387 -111.540 1 2010 1255.567410 264.957 4 0.233602 266.02 34.7 4.0 1.0
40 103425 0.91 8 60.1 35.101 -112.093 1 2010 1200.490337 344.211 7 0.157593 261.97 39.4 7.0 3.0
41 103426 0.88 8 47.1 35.111 -112.095 1 2010 1200.247792 345.321 7 0.157593 261.97 39.4 7.0 0.0
42 103427 0.86 8 40.4 35.109 -112.109 1 2010 1198.984931 345.099 7 0.157593 261.97 39.4 7.0 0.0

Based on this date difference column, we can give each group of consecutive dates a different index, so that we can use groupby to iterate over each group of dates.

In [83]:
df = df.reset_index(drop=True)
df['date_group_index'] = np.nan
curr_date_group_index = 1
for index, row in df.iterrows():
    if pd.isna(row['dayofdataset_diff']) or row['dayofdataset_diff'] <= 1.0:
        df.at[index, 'date_group_index'] = curr_date_group_index
    else:
        curr_date_group_index += 1
df.head(10)
Out[83]:
id confidence day frp lat lon month year x y dayofyear vpd temp humidity dayofdataset dayofdataset_diff closest_station closest_station_dist date_group_index fire_group_index
0 40048 0.94 4 66.9 33.418 -110.862 1 2010 1324.823718 157.398 3 0.290486 270.57 42.5 3.0 NaN USW00023183 105.971337 1.0 96823
1 64733 0.84 5 37.3 34.387 -111.540 1 2010 1255.567410 264.957 4 0.233602 266.02 34.7 4.0 1.0 USW00003103 84.972498 1.0 96823
2 103425 0.91 8 60.1 35.101 -112.093 1 2010 1200.490337 344.211 7 0.157593 261.97 39.4 7.0 3.0 USW00003103 39.102676 NaN 2
3 103426 0.88 8 47.1 35.111 -112.095 1 2010 1200.247792 345.321 7 0.157593 261.97 39.4 7.0 0.0 USW00003103 39.160723 2.0 96824
4 103427 0.86 8 40.4 35.109 -112.109 1 2010 1198.984931 345.099 7 0.157593 261.97 39.4 7.0 0.0 USW00003103 40.449813 2.0 96824
5 139328 0.92 11 69.7 39.143 -120.661 1 2010 435.128728 792.873 10 0.222111 274.20 66.3 10.0 3.0 USW00023225 15.534144 NaN 5
6 143581 0.89 12 31.9 35.131 -119.792 1 2010 499.372080 347.541 11 0.466023 277.97 45.9 11.0 1.0 USW00023155 74.987188 3.0 96825
7 143585 0.83 12 21.1 34.396 -118.957 1 2010 574.206285 265.956 11 0.313014 268.96 30.1 11.0 0.0 USW00023136 23.070951 3.0 96825
8 153322 0.83 12 26.1 34.395 -110.590 1 2010 1342.780135 265.845 11 0.209012 267.14 46.4 11.0 0.0 USW00093027 111.859573 3.0 96826
9 165302 0.83 13 34.6 33.185 -115.468 1 2010 897.353798 131.535 12 0.355731 269.00 20.8 12.0 1.0 USW00003104 80.816374 3.0 96826

Here, we will iterate through each "date group" and identify each "fire group" contained within, based on computation of pairwise fire distances. We will use python's itertools to determine the fire pairs in a group.

In [69]:
import itertools

We can use the Haversine Formula to compute distances between pairs of fires, and determine the closest weather station to each fire.

In [14]:
# Adapted from this StackOverflow post: https://stackoverflow.com/a/4913653
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

Those fires that are within 259 kilometers per day of each other will be placed into the same "fire group" and be given the same fire_group_index value.

In [86]:
max_date_group_size = 200
In [ ]:
curr_max_fire_group_index = df.shape[0]
df['fire_group_index'] = np.arange(curr_max_fire_group_index)
curr_max_fire_group_index += 1

for date_group_index, date_group_df in df.groupby(['date_group_index']):
    if pd.notnull(date_group_index) and date_group_df.shape[0] < max_group_size:
        fire_pairs = itertools.combinations(list(date_group_df.index.values), 2)
        print(date_group_index, len(list(fire_pairs)))
        group_pairs = {}
        for index_a, index_b in fire_pairs:
            dist = haversine(
                date_group_df.loc[index_a]['lon'], 
                date_group_df.loc[index_a]['lat'], 
                date_group_df.loc[index_b]['lon'],
                date_group_df.loc[index_b]['lat']
            )
            if dist <= wildfire_max_daily_spread*(1+np.absolute(date_group_df.loc[index_a]['dayofdataset'] - date_group_df.loc[index_b]['dayofdataset'])):
                if index_a in group_pairs.keys():
                    df.at[index_b, 'fire_group_index'] = group_pairs[index_a]
                elif index_b in group_pairs.keys():
                    df.at[index_a, 'fire_group_index'] = group_pairs[index_b]
                else:
                    curr_max_fire_group_index += 1
                    group_pairs[index_a] = curr_max_fire_group_index
                    group_pairs[index_b] = curr_max_fire_group_index
                    df.at[index_a, 'fire_group_index'] = curr_max_fire_group_index
                    df.at[index_b, 'fire_group_index'] = curr_max_fire_group_index

df.head(10)

2.2 Supplementary Weather Data

Using the NOAA Climate Data Online (CDO) service, we can obtain daily climate data from 2010 to 2016, from weather stations in the west coast. The CDO search tool allows us to select NOAA stations of interest and the types of data we want to download for a specified time period. In this case, I have manually selected 75 weather stations spread throughout the west coast region.

Along with the location (latitude and longitude) of each weather station, I selected to download data comprising the following variables:

Air Temperature

  • TAVG: Average Temperature
  • TMAX: Maximum temperature
  • TMIN: Minimum temperature

Precipitation

  • MDPR: Multiday precipitation total (use with DAPR and DWPR, if available)
  • DAPR: Number of days included in the multiday precipitation total (MDPR)
  • PRCP: Precipitation
  • SNWD: Snow depth
  • SNOW: Snowfall

Sunshine

  • PSUN: Daily percent of possible sunshine for the period
  • TSUN: Total sunshine for the period

Wind

  • AWND: Average wind speed
  • WDF2: Direction of fastest 2-minute wind
  • WDF5: Direction of fastest 5-second wind
  • WSF2: Fastest 2-minute wind speed
  • WSF5: Fastest 5-second wind speed
  • PGTM: Peak gust time
  • FMTM: Time of fastest mile or fastest 1-minute wind

Weather Type: binary indication of specific weather events

  • WT01: Fog, ice fog, or freezing fog (may include heavy fog)
  • WT02: Heavy fog or heaving freezing fog (not always distinguished from fog)
  • WT03: Thunder
  • WT04: Ice pellets, sleet, snow pellets, or small hail
  • WT05: Hail (may include small hail)
  • WT06: Glaze or rime
  • WT07: Dust, volcanic ash, blowing dust, blowing sand, or blowing obstruction
  • WT08: Smoke or haze
  • WT09: Blowing or drifting snow
  • WT10: Tornado, waterspout, or funnel cloud
  • WT11: High or damaging winds
  • WT13: Mist
  • WT14: Drizzle
  • WT15: Freezing drizzle
  • WT16: Rain (may include freezing rain, drizzle, and freezing drizzle)
  • WT17: Freezing rain
  • WT18: Snow, snow pellets, snow grains, or ice crystals
  • WT19: Unknown source of precipitation
  • WT21: Ground fog
  • WT22: Ice fog or freezing fog

After selecting data, the CDO tool sends an email with our requested data to download. I have done this and place the downloaded file in this GitHub repository so that we can easily load it with pandas.

In [11]:
cdo_df = pd.read_csv('data/1572799.csv')
cdo_df.head()
Out[11]:
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND DAPR FMTM MDPR ... WT11 WT13 WT14 WT15 WT16 WT17 WT18 WT19 WT21 WT22
0 USW00024133 BURLEY MUNICIPAL AIRPORT, ID US 42.5416 -113.7661 1266.1 2010-01-01 13.65 NaN 1009.0 NaN ... NaN 1.0 NaN NaN 1.0 NaN 1.0 1.0 NaN NaN
1 USW00024133 BURLEY MUNICIPAL AIRPORT, ID US 42.5416 -113.7661 1266.1 2010-01-02 9.84 NaN 206.0 NaN ... NaN 1.0 NaN NaN 1.0 NaN 1.0 NaN NaN NaN
2 USW00024133 BURLEY MUNICIPAL AIRPORT, ID US 42.5416 -113.7661 1266.1 2010-01-03 5.59 NaN 2359.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 USW00024133 BURLEY MUNICIPAL AIRPORT, ID US 42.5416 -113.7661 1266.1 2010-01-04 NaN NaN 228.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 USW00024133 BURLEY MUNICIPAL AIRPORT, ID US 42.5416 -113.7661 1266.1 2010-01-05 NaN NaN 2340.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 44 columns

And taking note of the size of this data frame:

In [12]:
cdo_df.shape
Out[12]:
(188890, 44)

2.3 Merging the data

To be able to match this weather data with our wildfire data, we will look at the weather from the closest station to each fire observation.

We will first need to map each station to its location.

In [13]:
cdo_loc_df = cdo_df.drop_duplicates(subset=['STATION'])[['STATION', 'LATITUDE', 'LONGITUDE']].set_index('STATION', drop=True)
cdo_loc_df.head()
Out[13]:
LATITUDE LONGITUDE
STATION
USW00024133 42.5416 -113.76610
USW00024131 43.5666 -116.24050
USW00093075 38.7500 -109.76278
USW00024132 45.7880 -111.16080
USW00093230 38.8983 -119.99470

Using the haversine formula from above, we can write a function that returns the ID of the closest weather station and the distance (in kilometers), given the coordinates of a fire:

In [15]:
def get_closest_station(fire_lon, fire_lat):
    station_distance_series = cdo_loc_df.apply(lambda row: haversine(row['LONGITUDE'], row['LATITUDE'], fire_lon, fire_lat), axis='columns')
    return (station_distance_series.idxmin(), station_distance_series.min())

Now we can apply this function along the entire wildfire dataframe, creating two new columns:

In [16]:
df['closest_station'], df['closest_station_dist'] = zip(*df.apply(lambda row: get_closest_station(row['lon'], row['lat']), axis='columns'))
df.head()
Out[16]:
id confidence day frp lat lon month year x y dayofyear vpd temp humidity dayofdataset dayofdataset_diff closest_station closest_station_dist
6 40048 0.94 4 66.9 33.418 -110.862 1 2010 1324.823718 157.398 3 0.290486 270.57 42.5 3.0 NaN USW00023183 105.971337
22 64733 0.84 5 37.3 34.387 -111.540 1 2010 1255.567410 264.957 4 0.233602 266.02 34.7 4.0 1.0 USW00003103 84.972498
40 103425 0.91 8 60.1 35.101 -112.093 1 2010 1200.490337 344.211 7 0.157593 261.97 39.4 7.0 3.0 USW00003103 39.102676
41 103426 0.88 8 47.1 35.111 -112.095 1 2010 1200.247792 345.321 7 0.157593 261.97 39.4 7.0 0.0 USW00003103 39.160723
42 103427 0.86 8 40.4 35.109 -112.109 1 2010 1198.984931 345.099 7 0.157593 261.97 39.4 7.0 0.0 USW00003103 40.449813

The mean distance from fires to the closest weather station is approximately 79 kilometers, or 49 miles:

In [17]:
df['closest_station_dist'].mean()
Out[17]:
78.8566412646424

Turning our attention to the weather dataframe now, we will convert its date values into a dayofyear column so that we can easily cross-reference from the wildfire dataset to determine the weather on the day of a fire.

We will need to import python's datetime to help with date formatting.

In [18]:
import datetime
In [19]:
def cdo_date_to_dayofyear(cdo_date):
    date = datetime.datetime.fromisoformat(cdo_date).date()
    return int(date.strftime('%-j'))

Now we can apply this conversion function to all of the date values in the CDO weather dataset:

In [20]:
cdo_df['dayofyear'] = cdo_df['DATE'].apply(cdo_date_to_dayofyear)

3. Exploration

3.1 Univariate Analysis

To begin our exploration, we can look at the distributions of some of our variables.

To visualize data, we can load helpful python visualization packages:

  • matplotlib
  • seaborn
In [21]:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib import cm
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
import seaborn as sns

# matplotlib notebook configuration options
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

First we will look at the distribution of the confidence variable. This variable represents how confident we are that the satellite has correctly classified a specific image as a wildfire. This distribution will give us some insight into how much we can trust the results of our analysis. Note that we have already restricted our analysis to data points with a confidence level higher than 0.8.

In [22]:
_ = plt.title("Distribution of Confidence")
_ = plt.xlabel("Confidence")
_ = plt.ylabel("Count")
_ = plt.xlim(0.8, 1.0)
_ = plt.hist(df['confidence'].values)

Based on this distribution and our filtering procedure, we can be confident in our analysis that the majority of our data points truly represent wildfires.

Next, we can look at the Fire Radiative Power variable's distribution to get a sense of how strong fires were between 2010 and 2016.

In [23]:
_ = plt.title("Distribution of Fire Radiative Power")
_ = plt.xlabel("Fire Radiative Power")
_ = plt.ylabel("Count")
_ = plt.hist(df['frp'].values)

This plot is not very helpful. We can tell that the overwhelming amount of our data points come from fires with low FRP values, but there is clearly some data that is not easily visible on this plot. Perhaps if we log scale the y-axis it will be more informative.

In [24]:
_ = plt.title("Distribution of Fire Radiative Power")
_ = plt.xlabel("Fire Radiative Power")
_ = plt.ylabel("log(Count)")
_ = plt.hist(df['frp'].values, log=True)

We can create these same histograms for the vpd, temp, and humidity variables:

In [25]:
_ = plt.title("Distribution of Vapor Pressure Deficit")
_ = plt.xlabel("Vapor Pressure Deficit")
_ = plt.ylabel("Count")
_ = plt.hist(df['vpd'].values)
/Users/markkeller/miniconda3/envs/cmsc320-final-project/lib/python3.7/site-packages/numpy/lib/histograms.py:754: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/Users/markkeller/miniconda3/envs/cmsc320-final-project/lib/python3.7/site-packages/numpy/lib/histograms.py:755: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
In [26]:
_ = plt.title("Distribution of Temperature")
_ = plt.xlabel("Temperature (degrees Kelvin)")
_ = plt.ylabel("Count")
_ = plt.hist(df['temp'].values)
In [27]:
_ = plt.title("Distribution of humidity")
_ = plt.xlabel("Humidity (percent)")
_ = plt.ylabel("Count")
_ = plt.hist(df['humidity'].values)

Next, we can look at the distribution over time using the year variable. To create this histogram we will want to bin by year rather than allowing matplotlib to choose the bins automatically, so we can use pandas groupby to get the counts per year explicitly:

In [28]:
year_df = df.groupby(['year']).size().reset_index(name='count')
year_df.head()
Out[28]:
year count
0 2010 5302
1 2011 7334
2 2012 24686
3 2013 13861
4 2014 12416

Note that now that we have the explicit counts we can use the matplotlib bar rather than hist.

In [29]:
_ = plt.title("Distribution of Year")
_ = plt.xlabel("Year")
_ = plt.ylabel("Count")
_ = plt.bar(year_df['year'].values, year_df['count'].values, 0.65)

Next, we can look at the distribution over the year using the dayofyear variable.

In [30]:
_ = plt.title("Distribution of Day of Year")
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Count")
_ = plt.hist(df['dayofyear'].values)

This plot shows that our distribution is fairly symmetric and unimodal, centered around day 225/365. But maybe the distribution will be more clear if we are able to view this data by month rather than day.

datetime can be used to format our month numbers into month names by creating a date object with our month number and then specifying an output date string format that only consists of the month name.

In [31]:
def month_num_to_name(month_num):
    date = datetime.date(2018, month_num, 1)
    return date.strftime('%B')

We will obtain counts of fires per month using pandas groupby.

In [32]:
month_df = df.groupby(['month']).size().reset_index(name='count')
month_df['month'] = month_df['month'].apply(month_num_to_name)
In [33]:
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Distribution of Date")
_ = plt.xlabel("Month of Year")
_ = plt.ylabel("Count")
bar_width = 0.65
_ = plt.bar(month_df['month'].values, month_df['count'].values, bar_width)

Finally we can look at the entire span of time from 2010 to 2016 and count the fires observed each month.

In [34]:
year_month_df = df.groupby(['year', 'month']).size().reset_index(name='count')
year_month_df['year-month'] = year_month_df.apply(lambda row: ("%s %i" % (month_num_to_name(row['month']), row['year'])), axis='columns')
year_month_df.head()
Out[34]:
year month count year-month
0 2010 1 18 January 2010
1 2010 2 49 February 2010
2 2010 3 172 March 2010
3 2010 4 213 April 2010
4 2010 5 147 May 2010
In [35]:
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Distribution of Year and Month")
_ = plt.xlabel("Year and Month")
_ = plt.ylabel("Count")
_ = plt.bar(year_month_df['year-month'].values, year_month_df['count'].values, 0.65)
_ = plt.xticks(year_month_df['year-month'].values[::3], rotation=70)

3.2 Multivariate Analysis

Because this wildfire data also contains location data, we can view it on a map as well. We will begin our multivariate analysis by looking at the location data in relationship to other variables.

To do so, we will load python packages to assist with geographical data visualization

  • cartopy: geographical visualization toolkit for matplotlib
In [36]:
import cartopy.crs as ccrs
#from cartopy.io.img_tiles import OSM
from cartopy.io.img_tiles import Stamen
imagery = Stamen(style='terrain-background')

To determine the extent (ranges) of our maps, we can determine the range of the latitude and longitude variables in the wildfire dataset.

In [37]:
min_lon = df['lon'].min()
max_lon = df['lon'].max()
min_lon -= (max_lon - min_lon) * 0.2
min_lat = df['lat'].min()
max_lat = df['lat'].max()

For our visualization, we will plot fires from different years using different colors. To do so, matplotlib's colormap module can be used to map each year to a color of the rainbow.

We will make this conventient for ourselves later by creating a function that will generalize this mapping to any sorted array.

In [38]:
def get_cmap_dict(vals, cmap_name='rainbow', numeric=True):
    cmap = cm.get_cmap(cmap_name)
    if numeric:
        # assume vals already sorted
        min_val = vals[0]
        max_val = vals[-1]
        return dict(zip(vals, [cmap((val - min_val) / (max_val - min_val)) for val in vals]))
    else:
        len_vals = len(vals)
        return dict(zip(vals, [cmap(val / len_vals) for val in range(len_vals)]))
In [39]:
years = list(df['year'].unique())
year_to_color_map = get_cmap_dict(years)

Because the plot could become cluttered and difficult to interpret, we will first restrict the fires to those with the top strength (frp) for each year.

In [40]:
n_per_year = 30
df_top_frp = df.sort_values(['frp'], ascending=False)
df_top_frp_year_groupby = df_top_frp.groupby(['year'])

Next we will need to determine the minimum and maximum strength values, which will help us to make a legend.

In [41]:
min_frp = df_top_frp_year_groupby.min()['frp'].min()
max_frp = df_top_frp_year_groupby.max()['frp'].max()
frp_intervals = np.linspace(max_frp, min_frp, 4, endpoint=False)

Now we will create the plot.

In [42]:
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(1, 1, 1, projection=imagery.crs)
_ = ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
_ = ax.add_image(imagery, 8)
frp_scaling_factor = 0.003
for year, df_top_frp_year in df_top_frp_year_groupby:
    for index, row in df_top_frp_year.head(n_per_year).iterrows():
        _ = plt.plot(
            row['lon'], 
            row['lat'], 
            marker='o', 
            color=year_to_color_map[row['year']], 
            markersize=(frp_scaling_factor*row['frp']), 
            alpha=0.5, 
            transform=ccrs.Geodetic()
        )
_ = ax.set_title("West Coast Wildfires 2010-2016, Strongest %i per Year" % n_per_year)

year_legend_elements = [Patch(facecolor=year_color, label=year) for year, year_color in year_to_color_map.items()]
year_legend = ax.legend(handles=year_legend_elements, loc='upper left', title='Year')
_ = plt.gca().add_artist(year_legend)

frp_legend_elements = [Line2D([0], [0], marker='o', color=(0,0,0,0), label=np.floor(frp_val), markerfacecolor='#C0C0C0', markersize=frp_val*frp_scaling_factor) for frp_val in frp_intervals]
_ = ax.legend(handles=frp_legend_elements, loc='lower left', labelspacing=2, columnspacing=2, title='Fire Radiative Power')

Next we can perform multivariate analysis of the rest of our data and ask ourselves whether we see relationships between any of our variables.

In the following plot we will look at temperature over time.

In [43]:
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", data=df)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Temperature (kelvin)")

Next, we can incorporate year by coloring each point based on year.

In [44]:
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", hue="year", data=df, linewidth=0)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Temperature (kelvin)")

It might be easier to look at the means for each year.

In [45]:
year_dayofyear_df = df.groupby(['year', 'dayofyear']).mean().reset_index()
In [46]:
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Mean Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", hue="year", data=year_dayofyear_df, linewidth=0)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Mean Temperature (kelvin)")
In [47]:
year_month_mean_df = df.groupby(['year', 'month']).mean().reset_index()
In [48]:
month_tick_formatter = matplotlib.ticker.FuncFormatter(lambda x, p: month_num_to_name(int(x)) if x >= 1 and x <= 12 else "")
In [49]:
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Mean Temperature by Month")
ax = sns.lineplot(x="month", y="temp", hue="year", data=year_month_mean_df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Mean Temperature (kelvin)")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)

Because we have many more variable pairs to evaluate, we can start to look at correlations.

In [50]:
corr_df = df.corr()
_ = plt.figure(figsize=(11, 9))
_ = sns.heatmap(corr_df, annot=True, vmin=-1, vmax=1)
In [51]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Temperature")
_ = sns.scatterplot(x="temp", y="vpd", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Vapor Pressure Deficit")
In [52]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Humidity")
_ = sns.scatterplot(x="humidity", y="vpd", data=df)
_ = plt.xlabel("Humidity")
_ = plt.ylabel("Vapor Pressure Deficit")
In [53]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Temperature")
ax = sns.scatterplot(x="temp", y="frp", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Fire Radiative Power")
In [54]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Humidity by Temperature")
_ = sns.scatterplot(x="temp", y="humidity", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Humidity")
In [55]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Humidity")
_ = sns.scatterplot(x="humidity", y="frp", data=df)
_ = plt.xlabel("Humidity")
_ = plt.ylabel("Fire Radiative Power")
In [56]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Vapor Pressure Deficit")
_ = sns.scatterplot(x="vpd", y="frp", data=df)
_ = plt.xlabel("Vapor Pressure Deficit")
_ = plt.ylabel("Fire Radiative Power")
In [57]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Confidence")
_ = sns.scatterplot(x="confidence", y="frp", data=df)
_ = plt.xlabel("Confidence")
_ = plt.ylabel("Fire Radiative Power")
In [58]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Month")
ax = sns.scatterplot(x="month", y="frp", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Fire Radiative Power")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
In [59]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Humidity by Month")
ax = sns.scatterplot(x="month", y="humidity", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Humidity")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
In [60]:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Month")
ax = sns.scatterplot(x="month", y="vpd", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Vapor Pressure Deficit")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
In [61]:
sns.lmplot(x="temp", y="frp", data=df);
/Users/markkeller/miniconda3/envs/cmsc320-final-project/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval